This document is a data science report showing insights to features that can potentially lead to loan defaults.
Version : 0.7
Name : Loan Defaults Prediction Project
Purpose : Predicting if a loan is going to default or not
Date : 2025-03-06
Contributors : Charalambos Pittordis
Description : This work is a data science project that tries to predict if an invidual receiving a loan is going to end up into a loan default or not based on multiple features related to loan_amount, annual_income, purpose of loan and home_ownership status.
Source Code : TBC, on github
Origin : All Lending Club Loan Data
Description : 2007 through current Lending Club accepted and rejected loan data
Depth : from 2007 to current date
Perimeter : only residential sales
Target Variable : loan_default
Target Description : loan default = [True, False]
Variable Filetring : All variables containing outliers and those that required special knowledge or previous calculations for their use were removed
Missing Values : were replaced by the mean of their columns during feature engineering
Feature Engineering : No feature was created. All features were selected carefully, numerical features tranformed via StandardScaler whith Categorical features tranfromed via OneHotEncoding. Also applied Synthetic Minority Oversampling Technique (SMOTE) to handle over-sampling minorities within the OneHotEncoded categorical features i.e., classes underrepresented for purpose such as wedding and school.
Path To Script : TBC/ on github
Used Algorithm : We used a XGBClassifier algorithm (XGBoost) but this model could be challenged with other interesting models such as LogisticsRegression, and Keras Deep Learning Neural Networks.
Parameters Choice : We did perform hyperparameter optimisation via GridSearchCV and chose to use n_estimators=200, max_depth=12, learning_rate=0.1, enable_categorical=True; as these parameters gave a good AUC-ROC score and no overfitting.
Metrics : Accuracy, Precision, Recall (Sensitivity), F1-Score, ROC-AUC, Confusion Matrix
Validation Strategy : We splitted our data into train (80%) and test (2%)
Path To Script : TBC, on github
Model used : XGBClassifier
Library : xgboost.sklearn
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Library version : 2.1.4
Model parameters :
| Parameter key | Parameter value |
|---|---|
| n_estimators | 200 |
| objective | binary:logistic |
| max_depth | 12 |
| max_leaves | None |
| max_bin | None |
| grow_policy | None |
| learning_rate | 0.1 |
| verbosity | None |
| booster | None |
| tree_method | None |
| gamma | None |
| min_child_weight | None |
| max_delta_step | None |
| subsample | None |
| sampling_method | None |
| colsample_bytree | None |
| colsample_bylevel | None |
| colsample_bynode | None |
| reg_alpha | None |
| reg_lambda | None |
| Parameter key | Parameter value |
|---|---|
| scale_pos_weight | None |
| base_score | None |
| missing | nan |
| num_parallel_tree | None |
| random_state | 42 |
| n_jobs | None |
| monotone_constraints | None |
| interaction_constraints | None |
| importance_type | None |
| device | None |
| validate_parameters | None |
| enable_categorical | True |
| feature_types | None |
| max_cat_to_onehot | None |
| max_cat_threshold | None |
| multi_strategy | None |
| eval_metric | None |
| early_stopping_rounds | None |
| callbacks | None |
| n_classes_ | 2 |
| _Booster |
| Training dataset | Prediction dataset | |
|---|---|---|
| number of features | 20 | 20 |
| number of observations | 140,036 | 35,010 |
| missing values | 0 | 0 |
| % missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 140,036 | 35,010 |
| mean | -0.0534 | -0.0537 |
| std | 0.864 | 0.796 |
| min | -0.918 | -0.907 |
| 25% | -0.393 | -0.401 |
| 50% | -0.186 | -0.187 |
| 75% | 0.107 | 0.113 |
| max | 125 | 73.6 |
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 140,036 | 35,010 |
| mean | 0.00155 | -0.000824 |
| std | 0.844 | 0.106 |
| min | -0.0812 | -0.0811 |
| 25% | -0.0354 | -0.0354 |
| 50% | -0.00949 | -0.0089 |
| 75% | 0.0223 | 0.0233 |
| max | 271 | 9.83 |
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 140,036 | 35,010 |
| mean | -0.143 | -0.148 |
| std | 0.928 | 0.925 |
| min | -1.79 | -1.65 |
| 25% | -0.906 | -0.906 |
| 50% | -0.338 | -0.37 |
| 75% | 0.277 | 0.277 |
| max | 4.3 | 4.3 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 140,036 | 35,010 |
| mean | 0.225 | 0.226 |
| std | 1.03 | 1.03 |
| min | -1.59 | -1.59 |
| 25% | -0.526 | -0.526 |
| 50% | 0.11 | 0.108 |
| 75% | 0.864 | 0.867 |
| max | 3.68 | 3.68 |
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 140,036 | 35,010 |
| mean | 0.0132 | 0.024 |
| std | 0.965 | 0.977 |
| min | -1.55 | -1.54 |
| 25% | -0.732 | -0.732 |
| 50% | -0.164 | -0.156 |
| 75% | 0.529 | 0.578 |
| max | 2.61 | 2.61 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 140,036 | 35,010 |
| mean | 0.117 | 0.117 |
| std | 1.24 | 1.22 |
| min | -0.13 | -0.13 |
| 25% | -0.13 | -0.13 |
| 50% | -0.13 | -0.13 |
| 75% | -0.13 | -0.13 |
| max | 60.9 | 44.7 |
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
Note : the explainability graphs were generated using the test set only.
| True values | Prediction values | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
Accuracy : 0.878
Precision : 0.909
Recall : 0.84
F1 Score : 0.873
ROC AUC : 0.878
Confusion Matrix :